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Re: Patent submission 
Tom, 

The attached paper will be submitted to a conference with a publication date of early August 1999 - the 
official release forms will come in a few days, After discussions with several people we think that the two 
ideas mentioned below should be covered by patent(s) and send you here the information that you can 
evaluate them. 



Thank you 



Hans Peter Graf <— r 



The attached paper contains two aspects that should be considered for coverage by patents. 
Combining 3D modeling with sample-based rendering 

We developed a new technique for generating photo-realistic images and animations by combining 3D 
modeling with sample-based rendering (described mainly in sections 3 and 5). 

Synthesizing scenes that look like real photographs is an extremely difficult problem and is one of the 
hottest research topics in computer graphics today. The main approach is to refine 3D models to the point 
where they look very similar to real- world objects. Yet compute requirements are growing exponentially as 
finer and finer levels of detail need to be modeled. Recently sample-based rendering is gaining interest as 
an alternative for generating photorealistic scenes. This technique starts from photos and synthesizes new 
scenes by integrating fragments of these photos. The main drawbacks of sample-based rendering are the 
large number of data that need to be recorded and a lack of flexibility in generating new scenes. 
Our new approach overcomes the problems of previous techniques by combining the flexibility of 3D 
modeling with the photo-realism of sample-based rendering. The paper describes this technique for 
generating talking heads, yet the concept is not limited to heads and faces, but is very general and 
applicable to any 3D object. 
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Perceptual parameters of texture maps 

One of ihe main problems for generating photo-realistic scenes is to find texture maps appropriate for 
integration into a new. scene. To solve, this problem we developed a new technique for describing the 
perceptual appearance of texture maps [briefly described in 5.3]. We can now search photographs 
automatically for segments with a certain appearance and generate databases of texture maps. For 
generating a new scene one specifies the desired appearance with a few perceptual parameters and can 
recall the texture maps from a database. 

The new technique applies a combination of geometric characterizations plus filtering and morphological 
operations to describe the appearance of a texture map. 

The two techniques mentioned above have many potential applications in computer graphics, animation, 
and image coding. The talking-head animations described in the paper are just one example. Another 
potential application is image compression, which relies more and more on symbolic descriptions of scenes 
(for example, in MPEG4 faces can be encoded in this way). The decoder then synthesizes the image from a 
few transmitted parameters. Such techniques rely on generating photorealistic scenes. 
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Abstract 

This paper describes a system for creating a photo-realistic 
model of the head that can be animated and Up-synched from a 
string of phonemes. Combined with a state-of-the-art text- to - 
speech synthesizer, it generates video animations* of talking 
heads that closely resemble real people. 

To obtain a naturally looking head, we choose a 'data-driven ' 
approach. We record a talking person and apply image 
recognition to extract automatically bitmaps of facial parts. 
These bitmaps are normalized, and parameterized before being 
entered into a database. For the synthesis, we start from a 
phoneme string and calculate motion trajectories for all the 
facial parts and the whole head. These trajectories provide the 
parameters for selecting the proper bitmaps from the database. 
Smoothing and blending is applied to these 'strings' of bitmaps to 
eliminate hard transitions and create a seamless animation for 
each facial part. 

A simple 3D model of the head guides the blending of the 
bitmaps into a whole head with a given pose. This model follows 
only the rigid movements of the head, and, unlike traditional 3D 
models, it does not deform with plastic deformations of the face, 
it only serves as a guide for extracting and combining bitmaps 
from/to an image. For example, the mouth area is modeled with 
six planes, one for the lips, two for the cheeks and three for the 
Jaw. Because the recorded bitmaps only incur minor 
deformations, due to the slight warps associated with rotations, 
their original appearance h preserved. The result is a talking 
head that resembles very closely the person who was originally 
recorded. 

Talking-head animations of this type are useful as a front* 
end for agents and avatars in such applications as virtual 
operators, help desks, educational and expert systems. 

Keywords 

Facial animation, computer vision, image-based rendering. 

1 INTRODUCTION 

Animated characters and in particular talking heads are 
playing an increasingly important role in computer interfaces. An 
animated talking head attracts immediately the attention of a 
user, can make a task more engaging and adds entertainment 
value io. an application. Seeing, a face makes many people feel 
more comfortable interacting with a computer. For learning tasks 
several researchers report that animated characters can increase 
the attention span of the user, and hence improve learning results. 
When used as avatars, lively, talking heads can make an 
encounter in a virtual world more engaging. Today such heads 
are usually cither recorded video clips of real people or cartoon 



characters lip-synching synthesized text. 

Often a cartoon character or robot-like face may do. yet we 
respond to nothing as strongly as to a real face. For an 
educational program, for example, a real face is preferable, A 
cartoon face is associated with entertainment, not, to be taken loo 
seriously. An animated face of a competent teacher, on the other 
hand, can create an atmosphere conducive to learning and 
therefore increase the impact of such educational software. 

Generating animated talking heads that look like real people 
is a very challenging task, and so far all synthesized heads are 
still far from reaching this goal. To be considered natural a face 
has to be not just photo-realistic in appearance, but must also 
exhibit realistic head movements, emotional expressions, and 
proper plastic deformations of die lips synchronized with the 
speech. We are trained since birth to recognize faces and facial 
expressions and therefore are highly sensitive to the slightest 
imperfections in a talking face. 

Instead of modeling a human head in minute detail, we start 
from photographic images of a person's face and generate 
animated sequences from these samples. This sample-based 
modeling approach preserves a high level of detail in the 
appearance of the face. By recording real movements of the head 
and of the lips and reusing them for the synthesis we obtain a 
model that is able to produce realistic lip and head movements, as 
well as emotional expressions. 

Section 3 defines how the head and its facial parts are 
modeled and the process of capturing sample data. In order to 
capture accurately realistic speech postures, we. have subjects 
speak short text sequences in from of the camera. A face 
recognition system then analyzes automatically this video 
footage and selects the proper samples as described in section 4. 
Section 5 presents the process of extracting bitmaps from video 
frames and how they are normalized and parameterized for an 
easy access in a database. Finally, the synthesis of the talking- 
head animation driven by a string of phonemes is described in 
section 6. 

2 PREVIOUS WORK 

Many different systems exist for modeling the human head 
(8], achieving various degrees of photo-realism and flexibility, 
but relatively few have demonstrated a complete talking-head 
functionality. 

2.1 3D head modeling 

Most approaches use 3D meshes to model in fine detail the 
shape of the head (1 1]|I2]. These models are created using 
advanced 3D scanning techniques such as a CyberWare range 
scanner [19] or are adapted from generic models using either 
Optical flow constraints [14] or facial features labeling [3][4]. 
Some of them include information on how to move vertices 
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according to physical properties of the skin and the underlying 
muscles [19]. To obtain a natural appearance they typically use 
images of a person that are texture-mapped onto the 3D model. 
Yet, when plastic deformations occur, the texture images are 
distorted resulting in visible artifacts. Another difficult problem 
is modeling of hair and such surface features as grooves and 
wrinkles. These are important for the appearance of a face and 
yet arc only marginally, if at all. modeled by most of these 
systems. The incredible complexity of plastic deformations in 
talking faces makes precise modeling extremely difficult. 
Simplifications of the models result in unnatural appearances and 
synthetic looking faces. 

2.2 Morphing 2D views 

An alternative approach is based on morphing between 20 
images. These techniques can produce photo-realistic images of 
new shapes by interpolating between two existing shapes. 
Morphing of a face requires precise" specifi cations of the 
displacements of many points in order to guarantee that the 
results look like real faces. Most techniques therefore rely on a 
manual specification of the morph parameters 1 16). Bcymer et at. 
115} and Bichsel [17] have proposed image analysts methods 
where the morph parameters are determined automatically, based 
on optical flow. While this approach gives an elegant solution to 
generating new views from a set of reference images, one still 
has to find the 1 proper reference images. Moreover, since the 
techniques are based on 2D images the range of expressions and 
movements they can produce is rather limited. 

2.3 Sample-Based synthesis 

Recently there has been a surge of interest in sample-based 
techniques (also referred to as data-driven) for synthesizing 
photo-realistic scenes. These techniques generally start by 
observing and collecting samples that are representative of a 
signal we wish to model. The samples are then parameterized so 
that they can be recalled at synthesis lime. Typically samples are 
processed as little as possible to avoid distortions. One of the 
early successful applications of this concept is QuickTime VR® 
[20], This system allows panoramic viewing of scenes as well as 
examining objects from all angles. Samples are parameterized by 
the direction from where they were recorded and stored in a two- 
dimensional database. 

Recently other researchers explored ways of sampling both, 
texture and 3D geometry of faces [3][4J, producing impressive 
animations of facial expressions. These systems use multiple 
cameras or facial markers to derive the 3D geometry and texture 
of the face in each frame of video sequences. Deriving the exact 
geometry of such details as grooves, wrinkles, lips, and tongue as 
they undergo plastic deformations remains, however, difficult. 
Extensive manual measuring in the images is required, resulting 
in a labor intensive capture process. Textures are processed 
extensively to match the underlying 3D .model and may loose 
some of their natural appearance. These systems have not yet 
been demonstrated for speech reproduction. 

2.4 Talking-Head systems 

A talking-head synthesis technique based on recorded 
samples that are selected automatically has been proposed by 
Brcgler et al. (13}. This system can produce videos of real people 
uttering text they never actually said. It uses video snippets of tri- 




phones (3 subsequent phonemes) as samples. Since these video 
snippets are parameterized with the phoneme sequence, the 
resulting database is very large. Moreover, this parameterization 
can only be applied to the mouth area, precluding the use of other 
facial parts such as eyes and eyebrows that are carrying important 
conversational cues. 

Ezzat et al. {5] have demonstrated a sample-based talking 
head system that uses morphing* to generate intermediate 
appearances of mouth shapes from a very small set of manually 
selected mouth samples. While morphing generates smooth 
transitions between mouth samples, this $ystem does not model 
the whole head and does not synthesize head movements and 
facial expressions. Cosatto et aL [6] presented a sample-based 
talking head that uses several layers of 2D bit-planes as a model. 
Neither facial parts nor the whole head are modeled in 3D and 
therefore the system is limited in what new expressions and 
movements it can synthesize. 

3 MODEL 

3.1 Definition 

In defining our model of the head, wc attempt to combine the 
flexibility of 3D models with the realism of images. A key 
problem with sample-based techniques is to control the number 
of image samples that need to be recorded and stored. A face's 
appearance changes due to talking, emotional expressions and 
head orientation, leading to a combinatorial explosion in the 
number of different appearances. To keep the number of samples 
at a manageable level we divide the face into a hierarchy of parts 
and model each part independently. This results in a compact 
model that can create animations with head movements, speech 
articulation and different emotional expressions. 

Our face model is defined as follows: 

1. Hierarchy of parts: The head is separated into a base face' 
(Figure la) and a number of facial parts. The base face 
covers the area of the whole face serving as a substrate onto 
which the facial parts are integrated. The facial parts are: 
mouth with cheeks, jaw, eyes, and forehead with eyebrows 
(Figure lb). Nose and ears arc not modeled separately, but. 
arc part of the base face. 

2. 3D model: The shape of each facial pari is approximated 
with a small number of planes. This set of planes is used as 
a guide to map the facial parts onto the base face in a given 
pose (Figure led). The positions and orientations of these 
planes follow the movements of the head, yet their shapes 
remain constant, even when the corresponding facial parts 
undergo plastic deformations. Hence, a model plane is more 
like a local window onto which a facial pan is projected 
than a polygon of a traditional 3D model. 

3. Sample bitmaps: For each facial part, sample bitmaps are 
recorded that cover the range of possible appearances 
produced by plastic deformations. No separate bitmaps are 
recorded to account for different head orientations, for the 
base face, bitmap samples are recorded with the head in 
different orientations. The range of head rotations we 
consider at the moment is ±15°. 

There is no unique way of decomposing a face into parts, and 
no part of the face is truly independent from the rest. Muscles 
and skin are highly elastic and tend to spread a deformation in 
one place across a large pan of the whole face. The 
decomposition described here was chosen after studying, how 
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facial expressions are generated by humans (25), and how they 
are depicted by artists and cartoonists [24]. 

To generate a face with a certain mouth shape and emotional 
expression, the proper bitmaps are chosen for each of the facial 
parts. The head orientation is known from the base face, so that 
we can project the bitmaps onto the base face using simple 
warping (23). This operation is similar to traditional texture 
mapping. The difference with traditional 3D modeling techniques 
is (hat for plastic deformations we select different bitmaps, rather 
than trying to squeeze one single bitmap into any new shape. 
Only rigid movements such as rotation and translation of the 
whole head plus the rotation of the jaw are modeled. The bitmaps 
of the facial parts are integrated Into the base face with proper 
feathering (alpha-blending at the edges), so that they blend 
smoothly into the base face, without introducing artifacts (Figure 
10- 

We consider here only a limited range of frontal views that 
are typical for movements during spontaneous speech. Then we 
do not need bitmaps recorded with different head orientations for 
the facial parts. Empirical studies showed that for a range of ±15* 
of rotation, warping does not introduce serious distortions in the 
bitmaps that would be considered unnatural. We limit the 
discussion here to a range that can be covered with a single set of 
bitmaps and model planes as shown in figure 1. The model can, 
however, be adapted to cover a wide range of orientations. 
Sample bitmaps have to be recorded from different angles, and 
the planes of the model need to be adjusted. 

3.2 Capturing sample bitmaps 

A head model is instantiated in two steps. First a few 
measurements are made on the subjects face to determine its 
geometry, namely the relative positions of eye comers, nostrils, 
mouth comers and the bottom of the chin. Using these 
measurements, the model planes are adapted for each facial part. 
Since this is done only once, there is little incentive to automate 
it. Techniques exist, such as the one described in [14| that can 
adapt a generic head model from video sequences showing head 
movements. This may be useful if only video footage exists 
without the person being present. 

Once the 3D model is defined, each face part is populated 
with bitmaps representative of its appearances, A person is 
recorded while speaking freely a few short sentences, to get all 
the different mouth shapes. For the examples shown here the lady 
spoke 14 phrases, each two to three seconds long- We try to keep 
the capture process as simple and non-intrusive as possible, since 
we are interested in capturing the typical head movements during 
speech as well as special mimics and unique ways of articulating 
» words. In particular, we avoid any head restraints or forced pose, 
such as requiring the subject to watch constantly in a given 
direction. Guenter ct al. (3] present a sophisticated technique that 
involves gluing dozens of fluorescent dots on the subjects face. 
Later, image processing is able to remove the dots. Thanks to 
robust recognition algorithms (see section 4) we do not need any 
special markers on the subjects face, but rather exploit the 
natural richness of features of the face. We also avoid the need of 
multiple cameras. Knowing the positions of a few points in the 
face allows recovery of the head pose using techniques described 
in section 4. 

Up movements can be extremely fast, which may cause 
blurry Images, when the frame rate is not high enough. Recording / 
60 fields per second instead of 30 frames, or using a shutter can 




solve this problem without having to resort to expensive cameras. 
We adjust luminance and hue of the facial pans and the base 
face, so that they will blend seamlessly. By making sure that the 
illumination is reasonably homogeneous one can avoid excessive 
color corrections that may introduce artifacts. Moreover, having a 
background of uniform and neutral color makes finding the 
location of the head easy. We currently capture frames of 
560x480 pixels in size with the head being about four fifths of 
this height, ensuring a high level of fidelity in rendering details of 
facial features, skin and hair. 

Quite some effort has gone into developing the whole system 
in a way that the capturing process remains easy and cheap. 
Eventually the system should be usable outside the lab by 
relatively unskilled personnel, or even at home by the user 
himself. The final goal is to have an easy procedure where you 
can quickly produce an animated head of yourself. 

4 RECOGNITION 

Sample-based synthesis of talking heads depends on a 
reliable and accurate recognition of the face and the positions of 
the facial features. Without an automatic technique for extracting 
and normalizing facial features, a manual segmentation of the 
images has to be done. Considering that we need samples of all 
lip shapes, of different head orientations and of several emotional 
expressions, thousands of images have to be searched for the 
proper shapes. If we also want to analyze the lip movements 
during transitions between phonemes, we have to analyze 
hundreds of thousands of images. Clearly, it is not feasible to do 
such a task manually. 

The main challenge for the face recognition system is the high 
precision with which the facial features have to be located- An 
error as small as a single pixel in the position of a feature distorts 
the pose estimation of the head noticeably. To achieve such a 
high precision our analysis proceeds in three steps, each with an 
increased resolution. The first step finds a coarse outline of the 
head plus estimates of the positions of the major facial features. 
In the second step the areas around the mouth, the nostrils and 
the eyes are analyzed in more detail. The third step, finally, 
zooms in on specific areas of facial features, such as the corners 
of the eyes, of the mouth and of the eyebrows and measures their 
positions with high accuracy. . 

4.1 Locating the face 

In a first step the whole image is searched for the presence of 
heads, and their locations are determined. Each frame is analyzed 
with two different algorithms. The first type of analysis is a color 
segmentation' to find the areas with skin colors and colors 
representative of the hair. 

The second, type of analysis segments the image based on 
textures and shapes. This analysis uses only the luminance of the 
image. First, the image is filtered with a band-pass filter, 
removing the highest and lowest spatial frequencies (figure 2a). 
"Then a morphological operation followed by adaptive 
thresholding results in a binary image where areas of facial 
features are marked with blobs of black pixels (figure 2b). 

The color analysis as well as the texture analysis produce sets 
of features. Combinations of these features are evaluated with 
classifiers, testing their shapes and relative positions. For 
example, an area marked by the color analysis as a candidate of a 
face area is combined with candidates of eye areas produced by 
the texture analysis. If relative sizes and positions match closely. 
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those of a reference face, this combination is evaluated further 
and combined with other features. Otherwise it is discarded. 

In order to save computation Ume, the analysis starts with 
simple representations, and only if the result is not satisfactory, a 
more complex representation is used. For example, when a 
classifier tries to determine whether three features represent two 
eyes and a mouth, it takes in a first pass only the center of mass 
of each feature into account and measures their relative positions. 
If the results are ambiguous, the analysis is repeated looking also 
at the shape of each feature^ using the outlines of each connected 
component in the image. 

This bottom-up approach of evaluating combinations of 
features produces reliably and quickly the location of the head as 
well as the positions of the major facial features (figure 2c), In 
the videos used for extracting samples, there is usually only one 
person present and the lighting is fairly uniform. Moreover, the 
background is static with little texture. Locating the head in 
images of such quality is rather easy and can even be done at a 
strongly reduced resolution without loosing reliability. Typically, 
images are down-sampled to a quarter or a ninth of the original 
size for this analysis in order to speed it up. 

4.2 Locating facial features 

Finding the exact dimensions of the facial features is more 
challenging, since the person being recorded. is moving the head 
and is changing facial expressions while speaking. This can lead 
to great variations in the appearance of a facial feature and can 
also affect the lighting conditions. For example, during a nod a 
shadow may fall over the eyes. Therefore, the analysis described 
above does not always produce accurate results for all facial 
features and we need to analyze further the areas around eyes, 
mouth, and the lower end of the nose. 

The algorithm proceeds by analyzing the color space, 
periodically retraining it with a small number of frames. For 
example, the area around the mouth is cut out from five frames 
and with a leader-clustering algorithm the most prominent colors 
in the area are identified. By analyzing the shapes of the color 
segmcnis we can assign the colors to different parts, such as the 
mouth cavity, the teeth and the lips (figure 2d, 2e, 20- By 
repeating the color calibration periodically, we keep track of 
changes in the appearances of the facial features. 

The texture analysis is also adapted to the particular facial 
feature under investigation by adjusting the filter parameters to 
the size and shape of a feature. In this way, the combination of 
texture and color analysis produces reliable measurements of the 
positions and outlines of the facial features. 

Errors made by the system arc of two types. The first type is a 
complete failure to identify a facial feature and the second type is 
inaccuracy in the measurements. A failure to identify a feature 
happens in about l%of the frames, mostly when the head moves 
over a wider range than what was seen in the training Images. 
The accuracy achieved for the dimensions of the mouth are 
typically ±2 pixels (standard deviation), where the width of the 
mouth is around 100 pixels. More details on this face analysis 
system can be found in (18], It has been tested in a large number 
of lip reading experiments, analyzing lip shapes of 50 different 
people pronouncing over 5.000 utterances, recorded under 
varying lighting conditions { 10]. 

4.3 High accuracy feature points 

For measuring the head pose, a few points in the face have to 




be measured with high accuracy, preferably with an error of less 
than one pixel. The techniques described above tend to produce 
variations of, for example, ±2 pixels for the eye corners. Filtering 
over time can improve these errors significantly, yet a more 
precise measurement is still preferable. 

We therefore add a third level of analysis to measure a few 
feature points with the highest accuracy. From a training set of 
300 frames a few representative examples of one feature point 
are selected. For example, for measuring the position of the left 
lip comer, nine examples are selected (figure 2i). These samples 
are chosen based on the dimensions of the mouth (figure 2h). 
This means that the training procedure selects mouth images with 
three different widths and three different heights. From those 
images the areas around the left comer ore cut out. For analyzing 
a new image, one of these sample images is chosen, namely the 
one where the mouth width and height are most similar, and this 
kernel is scanned over an area around the left half of the mourn. 

To measure the similarity between the kernel and the area 
being analyzed, both are filtered with a high-pass filter before 
multiplying them pixel by pixel (figure 2j). This convolution 
identifies very precisely where a feature point is located (figure 
2k). The standard deviation of the measurements is typically less 
than one pixel for the eye comers and filtering over Ume reduces 
the error to less than 0.5 pixels. 

The time required for this operation scales with the kernel size 
times the size of the analyzed area. We found empirically that a 
kerne) size of 20x20 pixels provides adequate robustness and 
analyzing an area of 100x100 pixel takes less than 100 ms on a 
300MHz PC. 

The features we measure arc mouth, nostrils, eyes and 
eyebrows. Knowing the positions and shapes of these features is 
sufficient to identify visemcs of the mouth and the most 
prominent emotional expressions. Sometimes the interior of the 
mouth is also analyzed to get a better measure of Hp protrusion 
and stress put on the lips. 

4.4 Pose estimation 

We apply a pose estimation technique reported in [21], using 
six feature points in the face: The four eye comers and the two 
nostrils. This technique starts wiuTthc assumption that all model 
points lie in a plane parallel to the image plane (corresponds to a 
orthographic projection of the model into the image plane plus a 
scaling). Then, by iteration, the algorithm adjusts the model 
points until their projections into the image plane coincide with 
the observed image points. The pose of the 3D model is obtained 
by solving iterative! y the following linear system of equations: . 

0 

M,«£j»y # <l+* f )-* 

M | is the position of object point i. i and j are the two first 
base vectors of the camera coordinate system in object 
coordinates, f is the focal length and Z 0 is the distance of the 
object origin from the camera. I, j, and Z 0 are the unknown 
quantities to be determined. x t ,y { is the scaled orthographic 
projection of the model point t, jr 0 ,y 0 is the origin of the model 
in the same plane, €, is a correction term due to the depth of the 
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model point. e, is the parameter that is adjusted in each iteration 

until the algorithm converges. This algorithm is very stable; also 
with measurement errors, and it converges in just a few 
iterations. 

4.5 Errors 

If the recognition module has failed to identify eyes or 
nostrils on a given frame, we simply ignore that frame during the 
mode) creation process. The recognition module marks the inner 
and outer corners of both eyes, as well as the center of the 
nostrils. The location of the nostrils is very reliable and robust. 
We arc able to derive their position with sub-pixel accuracy by 
applying low-pass filtering on their trajectories. The location of 
the eye comers is less reliable because their positions change 
slightly during closures. We ignore frames on which the eyes are 
closed. The errors in the filtered positions of these feature points 
are typically less than one pixel. A study of the errors in the pose 
resulting from errors of the recognition is shown on table I. All 
possible combinations of recognition errors are calculated for a 
given perturbation (with 6 points and 9 possible errors, 
all 9* » 53 1441 poses have been computed). 
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Table 1: The values shown in the table are the maximum 
errors in the calculated pose (x,y,z angles in degrees and 
distance to camera in mm) for perturbations of the measured 
feature points by: 0.5. I. 1.5 and 2 pixels. The last column 
shows an average error for a perturbation of 1 pixel. The 
subject was at a distance of lm from the camera. The camera 
focal length was 15mm and its resolution 560x430 pixels. 

5 SAMPLES OF FACIAL FEATURES 

5.1 Unit 

There are two choices in selecting the unit, of the samples, 
either single images, or short sequences of images. Bregler et al. 
P 3] use video sequences of triphones as the basic sample unit. 
This results in large databases but allows a semantical ly 
meaningful parameterization and requires fewer samples for 
synthesizing a new sequence. We use both units: Single frames, 
to keep the database size low and short sequences where they are 
clearly advantageous for the animation. For facial pans, we use 
mostly single images. Since the recognition module provides 
extensive information about the shape of facial features it is 
possible to parameterize them reliably. For cases where the 
appearance of the facial part cannot be properly described by the 
chosen parameters, e.g. a smiling mouth, we store short 
sequences labeled with their appearance. For the base face we 
also store short sequences of typical head movements. 

5.2 Normalization 

Before the image samples are entered into the database they 
are corrected in shape and scale to compensate for the different 
head orientations when they were recorded. From the recognition 
module the position and shape of facial parts as well as the pose 




of the whole head are known (figure 3a). To extract facial parts 
from the images we first project the planes of the 3D model into 
the image plane (figure 3b). The projected planes then mark the 
extent of each facial part (figure 3c). These areas are "un« 
warped" into normalized bitmaps (figure 3d). Any information 
about the shape produced by the recognition module is also 
mapped into the normalized view and stored along the bitmap in 
a data-structure. For example, the recognition module produces 
the outline of the lips encoded as a sequence, of points. All these 
points are mapped into the normalized plane before entering 
them into the database. 

5.3 Parameterization 

Once all samples of a face part are extracted from the video 
sequences and normalized, they need to be labeled and sorted in a 
way that they can be retrieved efficiently. To parameterize a 
facial pan we choose some of the measurements produced by the 
recognition module. In figure 3e, for example, we parameterize 
the mouth with three parameters: The width (the distance 
between the two comer points), the y-positibn of the upper lip 
(the y- maximum of the outer lip contour) and the y-position of 
the lower lip (the y-mintmum of the outer lip contour). Samples 
of other facial parts are parameterized in a similar way. 

Beside geometric features, we also use parameters describing 
the appearance of a facial part. The filtering processes described 
in section 4.1 provide a convenient way of characterizing the 
texture of a sample. By filtering a bitmap with a band-pass filter 
and measuring the intensity in three or four frequency bands, we 
obtain a characterizaiion of the texture that can be used to 
parameterize the samples. In this way, we can difTereniiaie 
between samples that have the same geometrical dimensions, but 
a different visual appearance. 

The space defined by the parameters of a face part is 
quantized at regular intervals. This creates an n -dimensional grid, 
where n is the number of parameters (figure 30. and each grid 
point represents a particular shape. First the numbers of intervals 
on the axes of the grid are chosen, then all samples are scanned 
and the distribution of each parameter analyzed. Based on this 
information the exact positions of grid intervals are set. 

5.4 Database 

Searching through all samples, we now populate each grid 
point with the k closest candidate bitmaps. Three parameters 
govern the size I of the resulting database of samples: The 
number of parameters n and the number of intervals p on each 
axis of the parameter space and.*, the number of samples kept at 
each grid point. 
/<« 

1-0 

Having multiple samples per grid point, i.e. * > I is useful 
for several reasons. In the ■debugging- phase of the database an 
operator can choose the best of a small set of automatically 
selected samples. Another reason to keep multiple samples is that 
such expressions as a smile or putting pressure on the lips 
produce visually different mouth shapes for one set of 
parameters. One could increase the dimensionality of the 
parameter space, yet this would increase the number of samples 
drastically. By selectively populating grid points with more than 
one sample one can cover such cases more efficiently. 
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There is a trade-off between the size of the database and the 
quality of the animation thai it can generate. Reducing the 
number of parameters will decrease the precision with which a 
sample can be characterized and result in a poor selection of 
samples. Reducing the number of intervals means bigger 
differences between neighboring samples and therefore the need 
to synthesize more transition samples that arc of lower visual 
quality. In our example the mouth shape is characterized with 3 
parameters, dividing them into 4, 4 and 3 intervals, respectively, 
resulting in a database of 43 mouth samples. About 40 additional 
samples are necessary to store the remaining facial pans. Each 
sample is about 5KB (compressed using JPEG), resulting in 
about 1/2MB of storage. We also need short sequences for the 
base face totaling about 2MB (compressed using MPEG2). 
Hence we have a very compact database of little over 2MB that 
produces high-resolution (560x480 pixels) animations. By 
scaling down the resolution we can generate animations from a 
database of a few hundred kilobytes. 

5.5 Errors 

Up to this point, the construction of the database was done 
fully automatically. But since no recognition, system is 100% 
accurate, some erroneous samples will be included in the 
database. Errors of the recognition module include alignment 
errors and selection errors. Alignment errors are due to errors in 
the position of features, producing misaligned samples. Selection 
errors happen when parameters are not measured accurately. It is 
also possible that, for a given image, the parameters chosen do 
not characterize the appearance with sufficient accuracy. For 
example, two lip shapes may have the same parameters, yet look 
different because in one image the speaker puts more pressure on 
the lips. Such effects are corrected by synthesizing short 
animation sequences and verifying visually that they look 
smooth. 

We developed a graphical interface allowing an operator to 
browse through the database, correct angles and positions of any 
sample and select new samples from a list of candidates if the 
sample at one grid point is not adequate. The system also tells the 
user which phonemes are mapped to the currently selected mouth 
visemc. This is useful to avoid articulation problems in the 
animation. By looking at short animations one can verify the 
visual continuity of the samples in the database. With this tool an 
operator creates a database in less than an hour. 

6 ANIMATION 

6.1 Trajectories 

Choosing n parameters to describe a sample creates an n- 
dimens tonal space of possible appearances. An animation 
produces a parametric function (or trajectory) through this space 
with time as parameter (figure 4). 

Figure 4b shows the resulting trajectory in the three 
dimensional space of mouth parameters for the utterance: "\ bet 
that". All the parameter values arc given in Table 4c. To create a 
video animation at 30 frames per second, the trajectory is 
sampled every 33.33 milliseconds. Then for each sample point 
the closest grid entry and its associated bitmap is chosen. The 
parameters describing feature shapes are chosen so that 
transitions between neighboring samples look smooth. This 
guarantees that the resulting animation is also visually smooth. 




for each substring in string < 

if (substring. length < oinlength, and 
substring is a plosive) 
substring is enlarged using 
surrounding substrings; 
olse if {substring. length < minlength) 
substring takes the value of 
surrounding substring; 

Definition: substring « consecutive samples of one 
vise me 

Table 2: Example of the rule based algorithm used to filter 
abrupt transitions arising from quantizing of a trajectory 
into a strina of samples. 
Nevertheless the string of samples is the result of quantization, 
and without some minor filtering, quantization errors might result 
in abrupt transitions or visible artifacts. We use a rule-based 
filtering algorithm to eliminate these artifacts. The example in 
Table 2 illustrates the kind of rules that are used. 

6.2 Transitions 

To smooth transitions between samples further, we 
synthesize transition samples between two existing samples by 
blending them together using the following equation: 
pix M o o • pixo , + (1 - a ) - pixb l f 




'6 ['».'»] 



During the transition interval from / 0 to f, the resulting, pixel 

pix is a blend of the corresponding pixels from sample a (pixa) 
and sample b (pixb). The number of samples that arc used to 
create a transition varies depending on the sampling rate of the 
trajectory and the duration of the samples. When the database 
contains few samples, the visual difference between samples is 
larger and more sophisticated techniques such as morphing [5] 
provide better results. Morphing is, however, computationally 
more expensive and requires correspondence points. When the 
visual difference between samples is reasonably low, the simpler, 
cheaper blending technique is adequate. Figure 4d shows the 
sequence of mouth shapes selected from the database, plus the 
transition shapes, marked with a green T. 

6.3 Mouth 

We animate the mouth of the talking head model from a 
siring of phonemes. Each phoneme is mapped to its visual 
equivalent, a viseme (mouth sample).- To account for 
coarticulation 1 we use a model described in [22). 

Instead of directly mapping a phoneme to a viseme, we 
derive each parameter of a viseme v f from a sequence of 

phonemes, where each phoneme has a target value Vp0 and a 

decay function g(t). The decay function is an exponential 
function describing the influence' a particular parameter has on 
its neighbors/The value of A is the span over which coarticulation 
is considered and corresponds to about 300 milliseconds. v . 



1 Coamcuhuon refers to the effect thai the Up shape is determined 
not only by the phoneme uttered at a time but also by previous and 
subsequent phonemes. 
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the value of parameter p at time c, defines a trajectory in the 
parameter space of the mouth shapes (figure 4b). 

6.4 Other facial parts 

We handle the animation of other facial parts using a model 
similar to the one developed for the MPEG4 facial animation 
subsystem (7J. Special markers are put in the text to control 
amplitude, length, onset and offset of facial animations. This is 
an easy way to provide synchronization of conversational cues, 
such as eye and eyebrow movements, eye blinks or head 
movements that accompany the spoken text. 

6.5 Rendering 

A frame of the final animaiion can be generated when 
bitmaps of all the face pans have been retrieved from the 
database. The bitmap of the base face is first copied into the 
frame buffer, then the bitmaps of face parts are projected onto the 
base face using the 3D model and the pose. At the moment w© 
consider only a limited range of rotation angles of ±15* so that 
there is no need for hidden surface removal. To avoid any 
artifacts from overlaying bitmaps, we use gradual blending or 
"feathering" masks. These masks are created by ramping up a 
blending value from the edges towards the center. These 
operations are implemented using basic OpenGL calls and the 
whole frame is rendered with just a few texture- map operations, 
which makes it possible to render the talking head in real time on 
a low cost PC. 

6.6 Text-To-Speech synthesizer 

The whole animation is driven by the output of the text-to- 
specch synthesizer (TTS). Starting from ascii text input plus 
some annotation controlling the intonation, the TTS produces a 
sound file. In addition it also outputs a phonetic transcription. 
This includes precise timing information for each phoneme plus 
some information about the stress. The animation module 
translates this information into a sequence of visemes (see figure 
4c), The stress information can be used to guide facial 
expressions and head movements. 

Since we are striving for natural appearance of our face, we 
were searching for a TTS that sounds natural. Most TTS today 
produce speech that has a distinctly robot-like sound Only very 
recent progress in speech synthesizer technology has produced 
speech that can be considered naturally sounding (2). 

7 RESULTS AND DISCUSSIO N 

We have produced short videos from two different head 
models, one female and one male. Figure 4 shows a few frames 
extracted from such a video. It is. of course, impossible to judge 
the quality of an animation from still pictures, this only shows 
that statically these frames look natural with no noticeable 
artifacts. To obtain feedback on the quality of these sequences, 
we have made informal tests with dozens of people. All tests 
were done with short clips, without integrating them into an 
application or trying to make them particularly entertaining (such 




as telling a joke). In such a setting the viewers concentrate fully 
on the talking head and notice any artifacts. While reactions are 
mostly positive, some viewers criticize the lip synchronization 
and the articulation - often over-articulation. Occasionally 
blending artifacts at the teeth and the jaw are visible. 

A formal test was done, to determine whether a talking head 
could improve intelligibility of spoken text in a noisy 
environment (1}. Two head models were tested, one 3D model, 
with and without texture maps of a real person, and this sample- 
based head. All head models improved intelligibility significantly 
and by about the same amount. These tests were done with an 
older version of the sample-based talking head and all heads used 
an older TTS 18). which has more of a robot-like voice. 
Subjective tests indicated that users dislike the clash between a 
naturally looking face and an unnatural voice (in contrast, a 
synthetic looking head with a natural voice seems to be perfectly 
acceptable). Therefore, in some of the tests the bare 3D model 
scored higher in being liked' than either the sample-based head 
or the 3D head with a person's texture map. The new TTS (2). 
sounds very natural and fits well the appearance of the sample- 
based head. In recent tests there have never been any complaints 
about a mismatch between voice and face. 

A long-term goal is to produce animation snippets that 
cannot be distinguished from a real person or at least to make the 
animations look so good that a viewer accepts them as a 
replacement for video clips of a person. In order to make a 
talking head a valuable addition to an application, it is not only 
its appearance that must be of very high quality. To keep a 
viewer pleased, the talking head must have a wide repertoire of 
behaviors, blending discreetly into the flow of action of the 
surrounding application. 

7.1 Coarticutation 

The coarticutation model was originally developed to srudy 
speech production. The resulting mouth shapes tend to over- 
articulate certain phonemes, giving some of the animations an 
unnatural look. It is a generic model adapted to a particular 
subject by adjusring many parameters. Yet with such a synthetic 
model it is difficult to capture the details of how a person is 
articulating speech. We are in the process of converting this 
generic model to a data-driven model. To accomplish this we 
record videos of commonly spoken sequences of diphones, 
trip hones and quadri phones. We then extract and normalize the 
trajectory of each lip parameter during articulation of the 
phonemes. Even though the number of these trajectories can be 
large, the size of each trajectory amounts to only a few hundred 
bytes, therefore resulting in a compact database. To synthesize 
new articulations of speech, the appropriate phoneme sequences 
ore identified in the coarticulation database and are concatenated. 

7.2 Model 

The simple 3D model we currently use covers a limited range 
of views. This is because it approximates facia) parts with only a 
few planes, resulting in visible artifacts when the head is rotated 
beyond about ±15* of the original sample's angle. To circumvent 
this limitation we plan to augment the model with new sets of 
samples that are extracted under different views. In this way a 
wider range of possible views can be covered by switching 
between sets of samples depending on the pose of the base head. 

Emotional expressions are generated mostly through 
animations of the upper part of the face or when there is no 
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talking. More samples of mouth shapes will be added where the 
person is, for example, smiling while talking. 

8 CONCLUSIONS 

We have presented a novel way to create head models that 
can be used to generate photo- realistic talking-head animations. 
Using image samples captured while a subject was speaking 
preserves the original appearance. Image analysis techniques 
make it possible to compute the pose of the head and measure 
facial parts on tens of thousands of video frames, resulting in a 
rich, yet compact database of samples. A simple 3D model of the 
head and facial parts enables perspective projection of the 
samples onto, a base head in a given pose, allowing head 
movements. The results are lively animations with a pleasing 
appearance that resemble closely a real person. This system looks 
promising for generating talking heads that can enliven 
computer-user interfaces as well as future encounters among 
avatars in cyberspace. 

Recent advances in accuracy and robustness of face 
recognition systems make the approach described here feasible. 
Combining machine vision with computer graphics is an idea that 
is receiving increasing attention recently [9]. Using photographs 
as pans of computer graphics is an old tradition. Yet as long as 
photographs had to be segmented manually, this approach was 
very costly. As image analysis algorithms mature, more and more 
general scenes can be segmented automatically, opening a whole 
new world of possibilities for synthesizing photorealistic scenes. 
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Figure 1: Model of the talking-head: the base face (a) and the normalized facial parts <b>. The head pose in the base face 
model of the head are used to project the facial parts (c). (d) shows a strongly rotated head pose to illustrate the 3D s 
parts; this pose is rotated too much to be blended into a base face without artifacts. In (e) the projected facial parts an 
base face (with the 3D wireframe removed (f))- The same facial parts are projected onto a different base face (g). 




Figure 2: Recognition process: A frame is first filtered using a combination of bandpass and morphological filters 
image is.ihresholded. connected blobs are extracted, and their shapes analyzed (b). Using a model of the head, the shape 
scored and the best scaring combination is kept. This marks the positions of the main facial features (c). Knowing tl 
mouth, a color analysis is performed to find the outline of the lips (d) (e) (0- Then the lips are measured (h). (i) sh 
convolution kernels used to find the exact position of the mouth corner. One of them is selected (yellow box in (i)> an 
the mouth corner (filtered first) 0) is convolved with this kernel. The result of this operation provides a very precise locr. 
corner fk). A similar analysis U performed for the eve corners and the corners of the eve brows (n). 
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S. 8 "™ ^f?" popubl£i,, S d:uabasc <* sampte: From the recognized eyes and nostrils (a), the pose of the head is calcuiat 
The 3D model is used to marie the areas of the facial parts (c), which are extracted and normalized (un-proiected) (d). Fcatui 
measured on the sample for parameterization (e) and the space of samples is populated (f). The samples in (g).sbow a subspace wh< 
upper-Up parameter is constant. v 



(a) 

Wonderful, I bet that sounds better! 
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vP^fi wlT atcu teX i (a)> '"'"^P"*' synthesizer provides a string oT phonemes and their timing"^ T 5 

3 col.). Then for each frame, paraiwiers that describe the mouih shapes are computed using coartieulation: (c). columns 4. S. ai 

? 1ST' V<UU ! S * * ,hdr mpeaiVt ™8 t - ^ valu " <" fine *• «*ewy in <he 3 dimensional Tsmce of m! 
GmSSllfiJJl 5 ! J" q ?" il ^ "l"*" daUbas ' «5 d (in ,his case " a ""y 4 »y 3 grid) (c: 7" col.). This siring grid ind 
^ ^ f h a r S , ° nd i?" JC * y moven « nB < c: 8 «" ). TransiUons are inserted to smooth out the animation o 

amSwith M^taTf |T " $Inn8 ra ° UI " bilralP, W> <PWn T in<liC: " e tranS,UOfl bitmapS) ' Fina " y nl0U1 " Sh3pe$ 
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DATE: 




TO: Ann Taylor 

Outside Counsel Coordinator 

RE: IDSNo. <^Td — QO</2^ 



Case Name: 



DIRECT MANAGED CASE 



g ™4^ZZ^i^~ jhj^ by .he 

It i S ^ I ?5 1 / nd ^ ^jjtys submission be assigned to the firm of 

' Hftaff WfrJf Mfa If possible, sh0 , 

vmtf the application. 

O TWs submission is already assigned to the firm of 



□ This application is already/should be (circle one) assigned to the firm of 
■ and the status of the U.S. Case is: 



Authorized 

Provisional filed on 

Pending 

Amendment due out on 
Patented 



Response to Office Action in Foreign Country (_ 

due out on 

□ Other: " 



Please note that the Case Folders are- 
El Attached herewith 
(I J£U.S. □ Foreign C 

□ Already in Middletown 

□ MIA (missing) 

□ Other: 



SPECIAL COMMENTS: 



3&£ 



Attorney/secy 




4/27/99 arof 
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=^p Disclosure No\ <3oOo -opV^ 0 ur Ref. No. ATT- 0 <&e 

Jftjf Assigned Attorney: Uenrl, Lu. K^n £Sz 
Mim Office Address: PO^** 

ph0Re N °- MO-a^-niai 

Expected Filing Date ifloloo 
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